In a previous post (LINK HERE), I began an investigation into NYC Vehicle Collisions data (LINK HERE). This personal project is meant to help me (a) practice my geospatial data skills, (b) replicate the spatial dependency model used by Morris et al. (DATE) (LINK HERE), and (c) potentially extend that model to include a time component, which would theoretically allow me to model whether the congestion pricing zone has impacted vehicle collision rates in 2025. In the interest of breaking this project down into digestible and achievable chunks, I have decided that this, part 2 of 3, will focus on purely spatial modeling. Fortunately, Mitzi Morris published this (LINK HERE) guide to the topic, which I’ll be following.
I outline most of the requisite data sources for the project in the previous post, although for this post I’m adding some tract-level traffic/built environment data (fragmentation index (LINK HERE) and traffic volume), as well as some ACS covariates (percentage of commuters that use public transit, population, median income, median age). My process here is to summarize the point-level crash data to the census tract level and then join it with the covariates.
One note: the fragmentation index and traffic variables provided by Morris only cover 2095 tracts (out of 2315 listed). There were also some ACS variables with NA values in a couple rows. Because I’m not trying to save the world, just figure out how to model things, I simply removed these from the analysis. In a different world (or perhaps the future), I would either find more complete data, determine which tracts I’m actually okay with filtering out (e.g., parks?), or interpolate the data in some way.
Below, please find a data table with the final dataset that I’m using.
# Loading NYC Census Tracts: https://data.cityofnewyork.us/City-Government/2020-Census-Tracts/63ge-mke6/about_data
nyc_tracts <- read.csv("data/2020_Census_Tracts_20250606.csv", stringsAsFactors = FALSE) %>%
rename("geometry" = "the_geom")
nyc_tracts <- st_as_sf(nyc_tracts, wkt = "geometry", crs = 4326)
nyc_tracts <- nyc_tracts %>%
mutate(
area_sqm = Shape_Area / 27878400
) %>%
select(
geometry,
BoroCT2020,
BoroCode,
BoroName,
area_sqm,
GEOID
)
# Loading collisions data: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/about_data
filter <- c(0, NA)
crashes <- read.csv("data/Motor_Vehicle_Collisions_-_Crashes_20250605.csv") %>%
select(CRASH.DATE,
LATITUDE,
LONGITUDE,
NUMBER.OF.PERSONS.INJURED,
NUMBER.OF.PERSONS.KILLED,
NUMBER.OF.PEDESTRIANS.INJURED,
NUMBER.OF.PEDESTRIANS.KILLED) %>%
filter(! LATITUDE %in% filter,
! LONGITUDE %in% filter) %>%
rename(
"date" = "CRASH.DATE",
"lat" = "LATITUDE",
"long" = "LONGITUDE",
"persons_inj" = "NUMBER.OF.PERSONS.INJURED",
"persons_death" = "NUMBER.OF.PERSONS.KILLED",
"ped_inj" = "NUMBER.OF.PEDESTRIANS.INJURED",
"ped_death" = "NUMBER.OF.PEDESTRIANS.KILLED"
) %>%
mutate(
date = mdy(date)
) %>%
st_as_sf(coords = c("long","lat"), crs = 4326)
# Grabbing some census variables via Tidycensus, using Keyring for API Key privacy
tidycensus_api_key <- key_get(service = "tidycensus_API", username = "my_tidycensus")
census_api_key(tidycensus_api_key)
census_vars <- get_acs(state = "NY",
county = c("Bronx", "Kings", "New York", "Queens", "Richmond"),
geography = "tract",
variables = c(medincome = "B19013_001",
population = "B01003_001",
median_age = "B01002_001",
transport_baseline = "B08301_001",
transport_pubtransit = "B08301_010"),
geometry = FALSE,
keep_geo_vars = FALSE,
year = 2022,
output = "wide"
) %>%
mutate(
GEOID = as.numeric(GEOID),
median_income = medincomeE,
population = populationE,
median_age = median_ageE,
prop_pubtransit = transport_pubtransitE / transport_baselineE
) %>%
select(
GEOID,
median_income,
population,
median_age,
prop_pubtransit
)
# Associating CP Zone, Pre/Post, Treatment, Borough, and Census Tract w/ Observations
crashes <- crashes %>%
st_join(nyc_tracts, join = st_within) %>%
filter(! is.na(area_sqm))
# Aggregating/summarizing data to census tract level, no time component
crashes_tract <- crashes %>%
group_by(BoroCT2020) %>%
summarize(tot_crashes = n(),
area = mean(area_sqm)) %>%
ungroup() %>%
st_drop_geometry()
# Getting Fragmentation Index from: https://github.com/mitzimorris/geomed_2024/blob/main/data/nyc_study.geojson
frag_data = st_read(file.path("data", "nyc_study.geojson"), quiet = TRUE) %>%
st_drop_geometry() %>%
select(BoroCT2010, frag_index, traffic) %>%
mutate(
BoroCT2010 = as.numeric(BoroCT2010)
)
# Joining everything together, selecting only variables that I want
crashes_tracts_geo <- nyc_tracts %>%
right_join(crashes_tract,
by = "BoroCT2020") %>%
left_join(census_vars,
by = "GEOID") %>%
left_join(frag_data,
by = c("BoroCT2020" = "BoroCT2010")) %>%
select(! c(area)) %>%
filter(! if_any(everything(), is.na))
# Interactive Data Table for Display
crashes_dt <- crashes_tracts_geo %>%
st_drop_geometry() %>%
mutate(
area_sqm = round(area_sqm, 3),
prop_pubtransit = round(prop_pubtransit, 3)
)
datatable(crashes_dt,
extensions = 'Buttons',
filter = "top",
rownames = FALSE,
options = list(
autoWidth = TRUE,
scrollX = TRUE
),
class = 'compact',
escape = FALSE
) %>%
formatStyle(
columns = names(crashes_dt),
`white-space` = "nowrap",
`height` = "1.5em",
`line-height` = "1.5em"
)
Though I made some plots in the previous post, I’ll make some more here to hopefully illustrate (a) characteristics of the outcome variable and (b) visually explore how the predictor variables might be related to the outcome.
ggplot(data = crashes_tracts_geo, aes(x = tot_crashes)) +
geom_histogram(aes(y = after_stat(density)),
bins = 500,
color = "lightblue",
fill = "lightblue",
alpha = 0.75) +
geom_density(color = "#FFC600", linewidth = 1) +
theme_bw() +
labs(title = "Distribution Total Crashes per Census Tract (2024-25)",
x = "Number of Crashes",
y = "Density")
This distribution has a mean of 50, standard deviation of 35, and variance of 1225, which is not great news for the Poisson distribution (which assumes that mean = variance). We’ll see how that does
Not all the predictors have a super clear relationship with the outcome variable, but it’s a good idea to include them anyway.
It’s worth noting that, in my previous post, I looked at crashes per square mile, whereas this analysis simply uses the raw counts. I believe this is (a) because the Poisson model is for discrete count data and (b) the dependency models adjust for population. I’m not 100% clear on why they’re not using the per square mile measure, though.
crash_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(fill = "tot_crashes",
fill.scale = tm_scale_intervals(values = "brewer.yl_or_rd",
style = "jenks"),
fill.legend = tm_legend(title = "Total Crashes, 2024-25"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
crash_map
income_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(fill = "median_income",
fill.scale = tm_scale_intervals(values = "brewer.greens",
style = "jenks"),
fill.legend = tm_legend(title = "Median Income"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
medianage_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(fill = "median_age",
fill.scale = tm_scale_intervals(values = "brewer.blues",
style = "jenks"),
fill.legend = tm_legend(title = "Median Age"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
tmap_arrange(income_map,
medianage_map,
nrow = 1, ncol = 2)
prop_pubtransit_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(fill = "prop_pubtransit",
fill.scale = tm_scale_intervals(values = "brewer.purples",
style = "jenks",
value.na = "grey"),
fill.legend = tm_legend(title = "Proportion that Commute Using Public Transit"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
pop_per_sqm_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(fill = "population",
fill.scale = tm_scale_intervals(values = "brewer.yl_gn",
style = "jenks"),
fill.legend = tm_legend(title = "Population"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
tmap_arrange(prop_pubtransit_map,
pop_per_sqm_map,
nrow = 1, ncol = 2)
frag_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(fill = "frag_index",
fill.scale = tm_scale_intervals(values = "brewer.pu_or",
style = "quantile",
midpoint = 0,
value.na = "grey"),
fill.legend = tm_legend(title = "Fragmentation Index"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
traffic_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(fill = "traffic",
fill.scale = tm_scale_intervals(values = "brewer.reds",
style = "quantile",
midpoint = 0,
value.na = "grey"),
fill.legend = tm_legend(title = "Traffic Volume"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
tmap_arrange(frag_map,
traffic_map,
nrow = 1, ncol = 2)
Ok–so we’ve visualized ad nauseum. Now, we’ve got to set up a spatial weights matrix. Essentially, something that tells us which of the tracts neighbor each other. In a different setting, we may want to vary weights (greater weight to tracts closer to the central tract, decreasing weights as we move away), but here we just set the weight equal to 1 if the tracts are neighbors and 0 if they are not.
The graph below shows these binary links between tracts.
nyc_nbs = poly2nb(crashes_tracts_geo)
nyc_coords = st_coordinates(st_centroid(crashes_tracts_geo['geometry']))
# Create a spatial points sf object for centroids
centroids_sf <- st_as_sf(data.frame(x = nyc_coords[,1], y = nyc_coords[,2]),
coords = c("x", "y"),
crs = st_crs(crashes_tracts_geo))
# Plot with tmap
spdep_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(col = "white") +
tm_lines(col = "white", lwd = 2) +
# Add points for centroids
tm_shape(centroids_sf) +
tm_dots(size = 0.25, col = "black") +
# Add lines for neighbors
tm_shape(st_as_sf(nb2lines(nyc_nbs, coords = nyc_coords))) +
tm_lines(col = "deepskyblue3")
spdep_map
In the outputs above and below, we can see that there are 2 regions with 0 links (islands), and 4 disjoint subgraphs, meaning networks that there are no paths between (Staten Island is a disjoint subgraph). This was done using the Queen’s case method (think of the squares that a Queen can reach on a chessboard–not only directly bordering neighbors, but diagonals as well.
Morris also recommends checking the rook’s case, so I’m outputing the results from both below:
nyc_nbs_rook = poly2nb(crashes_tracts_geo, queen=FALSE)
cat("*** Queen metric nb summary ***\n\n")
## *** Queen metric nb summary ***
summary(nyc_nbs)
## Neighbour list object:
## Number of regions: 1956
## Number of nonzero links: 10788
## Percentage nonzero weights: 0.2819702
## Average number of links: 5.515337
## 2 regions with no links:
## 259, 740
## 7 disjoint connected subgraphs
## Link number distribution:
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 2 18 53 143 286 466 464 314 149 39 12 6 2 2
## 18 least connected regions:
## 55 257 521 581 582 644 835 839 893 1210 1249 1402 1490 1576 1591 1605 1635 1893 with 1 link
## 2 most connected regions:
## 1718 1785 with 13 links
cat("\n\n*** Rook metric nb summary ***\n\n")
##
##
## *** Rook metric nb summary ***
summary(nyc_nbs_rook)
## Neighbour list object:
## Number of regions: 1956
## Number of nonzero links: 9052
## Percentage nonzero weights: 0.2365957
## Average number of links: 4.627812
## 3 regions with no links:
## 259, 740, 835
## 8 disjoint connected subgraphs
## Link number distribution:
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 3 23 77 283 513 599 311 92 38 9 5 2 1
## 23 least connected regions:
## 55 104 257 391 521 581 582 644 839 893 1210 1249 1402 1421 1453 1490 1518 1576 1591 1605 1635 1726 1893 with 1 link
## 1 most connected region:
## 1718 with 12 links
So…really not much difference. About 1 link, on average, separates the connections for any given tract using the Queen’s case vs. the Rook’s.
Exactly what was needed in this department was a touch confusing to me. Essentially, I got the idea that (a) I should examine whether existing links in the graph made sense (e.g., if there’s a link across water, does it make sense to keep it in from a pedestrian-focused context?) and then (b) make sure the final graph is fully connected, because that’s a requirement of the ICAR model I want to use. In Morris’s example it seems that part (a) was mostly an exercise in spatial data manipulation, whereas the actual dataset used for the final analysis is the original graph with all of its cross-water connections and a few added links to make it a single component. The code below achieves this, and checks my work.
# Function to add a symmetric edge
add_symmetric_edge_nb <- function(nb, i, j) {
nb_res = nb
if (!(j %in% nb[[i]])) {
if (nb[[i]][1] == 0) {
nb_res[[i]] = j
} else {
nb_res[[i]] <- sort(c(nb[[i]], j))
}
}
if (!(i %in% nb[[j]])) {
if (nb[[j]][1] == 0) {
nb_res[[j]] = i
} else {
nb_res[[j]] <- sort(c(nb[[j]], i))
}
}
return(nb_res)
}
nyc_nbs_connected = poly2nb(crashes_tracts_geo)
nyc_nbs_connected = add_symmetric_edge_nb(nyc_nbs_connected,
which(crashes_tracts_geo$BoroCT2020 == 3016400),
which(crashes_tracts_geo$BoroCT2020 == 5001800))
nyc_nbs_connected = add_symmetric_edge_nb(nyc_nbs_connected,
which(crashes_tracts_geo$BoroCT2020 == 4088400),
which(crashes_tracts_geo$BoroCT2020 == 4107201))
nyc_nbs_connected = add_symmetric_edge_nb(nyc_nbs_connected,
which(crashes_tracts_geo$BoroCT2020 == 4107201),
which(crashes_tracts_geo$BoroCT2020 == 4094201))
nyc_nbs_connected = add_symmetric_edge_nb(nyc_nbs_connected,
which(crashes_tracts_geo$BoroCT2020 == 4099801),
which(crashes_tracts_geo$BoroCT2020 == 4103202))
nyc_nbs_connected = add_symmetric_edge_nb(nyc_nbs_connected,
which(crashes_tracts_geo$BoroCT2020 == 2045600),
which(crashes_tracts_geo$BoroCT2020 == 2042600))
nyc_nbs_connected = add_symmetric_edge_nb(nyc_nbs_connected,
which(crashes_tracts_geo$BoroCT2020 == 1029900),
which(crashes_tracts_geo$BoroCT2020 == 2026900))
# Checking Work
spdep_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(col = "white") +
tm_lines(col = "white", lwd = 2) +
# Add points for centroids
tm_shape(centroids_sf) +
tm_dots(size = 0.25, col = "black") +
# Add lines for neighbors
tm_shape(st_as_sf(nb2lines(nyc_nbs_connected, coords = nyc_coords))) +
tm_lines(col = "deepskyblue3")
spdep_map
cat('is symmetric? ', is.symmetric.nb(nyc_nbs_connected), '\n')
## is symmetric? TRUE
summary(nyc_nbs_connected)
## Neighbour list object:
## Number of regions: 1956
## Number of nonzero links: 10800
## Percentage nonzero weights: 0.2822839
## Average number of links: 5.521472
## 7 disjoint connected subgraphs
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 19 52 144 287 461 469 313 150 39 12 6 2 2
## 19 least connected regions:
## 55 257 259 521 581 582 644 835 839 893 1210 1249 1402 1490 1576 1591 1605 1635 1893 with 1 link
## 2 most connected regions:
## 1718 1785 with 13 links
cat('nodes per component')
## nodes per component
table(n.comp.nb(nyc_nbs_connected)$comp.id)
##
## 1
## 1956
We can see here that the graph is symmetric, and that there is only a single component listed, with 1956 nodes. One confusing thing is that there are still 7 disjoint subgraphs in the output. I genuinely don’t know why this is and I cannot figure it out, so if you’re reading this and you do know, please email me.
Finally, the fun part. Now, I know that model building should be an extremely stepwise and iterative process–skipping steps is likely to result in issues. That said, I want to balance this with post length, so I’m roughly planning to include one model + output for each type of model I’m building.
poisson_1_model_file = file.path('stan', 'poisson_1.stan')
cat(readLines(poisson_1_model_file), sep="\n")
## data {
## int<lower=0> N;
## array[N] int<lower=0> y; // count outcomes
## vector<lower=0>[N] E; // exposure
## int<lower=1> K; // num covariates
## matrix[N, K] xs; // design matrix
## }
## transformed data {
## vector[N] log_E = log(E);
##
## // center continuous predictors
## vector[K] means_xs; // column means of xs before centering
## matrix[N, K] xs_centered; // centered version of xs
## for (k in 1:K) {
## means_xs[k] = mean(xs[, k]);
## xs_centered[, k] = xs[, k] - means_xs[k];
## }
## }
## parameters {
## real beta0; // intercept
## vector[K] betas; // covariates
## }
## model {
## y ~ poisson_log(log_E + beta0 + xs_centered * betas);
## beta0 ~ std_normal();
## betas ~ std_normal();
## }
## generated quantities {
## real beta_intercept = beta0 - dot_product(means_xs, betas); // adjust intercept
## array[N] int y_rep;
## vector[N] log_lik;
## {
## vector[N] eta = log_E + beta0 + xs_centered * betas; // centered data
## y_rep = max(eta) < 26 ? poisson_log_rng(eta) : rep_array(-1, N);
## for (n in 1:N) {
## log_lik[n] = poisson_log_lpmf(y[n] | eta[n]);
## }
## }
## }
So, this is not my model, it’s the simple Poisson model that Morris uses. So, rather than understanding it because I wrote it, I’m going to understand it because I explain it to you in this section. First, the data:
The transformed data block takes the log of population (E), which is a generally good principle for data that’s constrained to be all positive. It also means that the coefficient goes to the multiplicative scale, which is nice for interpretation.
In the parameters block, Morris defines an intercept, beta0, and a vector of coefficients, betas, for the covariates. The beta0 coefficient is split out so it can be applied to the log_E data.
The model specifies that y comes from a Poisson distribution, and gives beta0 and betas simple standard normal distributions (mean 0, sd 1).
Lastly, the generated quantities block. In Morris’s words, this is
for use in model checking and model comparison. Y_rep is used to hold
replicated data. Morris uses some optimization in this block that I
don’t entirely (or even really) understand, but this will be used to
conduct posterior predictive checks. That is, we can simulate fake data
and check their characteristics (mean, sd, quantiles) to see if the
model does a good job. Morris uses leave-one-out cross validation, and
uses the output from the loo() function–Expected Log
Predictive Density (ELPD)–which gives the log-probability of new data
given the model. A higher ELPD is better.
In the guide materials, Morris also makes some improvements to the model. Most notably, mean-centering the predictors. I ran the model without these adjustments and then added them. Output was, naturally, pretty similar. The improved model ran in only about 3 seconds per chain on my computer, as compared to 10-12 seconds for the original.
design_mat <- as.data.frame(crashes_tracts_geo) %>%
select(prop_pubtransit, median_income, median_age, frag_index, traffic) %>%
mutate(median_income = log(median_income),
traffic = log(traffic))
pois_data <- list(
N = nrow(crashes_tracts_geo),
y = as.integer(crashes_tracts_geo$tot_crashes),
E = as.integer(crashes_tracts_geo$population),
K = ncol(design_mat),
xs = design_mat
)
poisson_model_1 <- cmdstan_model(poisson_1_model_file)
set.seed(50)
poisson_fit_1 <- poisson_model_1$sample(data=pois_data, parallel_chains=cores, refresh=0)
## Running MCMC with 4 chains, at most 6 in parallel...
##
## Chain 1 finished in 2.0 seconds.
## Chain 2 finished in 2.0 seconds.
## Chain 3 finished in 2.1 seconds.
## Chain 4 finished in 2.3 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 2.1 seconds.
## Total execution time: 2.5 seconds.
pois_summary <- poisson_fit_1$summary()
vars <- c('beta_intercept', 'beta0', 'betas[1]', 'betas[2]', 'betas[3]', 'betas[4]', 'betas[5]')
pois_summary_subset <- pois_summary[pois_summary$variable %in% vars, ]
numeric_cols <- sapply(pois_summary_subset, is.numeric)
pois_summary_subset[numeric_cols] <- round(pois_summary_subset[numeric_cols], 3)
print(pois_summary_subset)
## # A tibble: 7 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 beta0 -4.43 -4.43 0.003 0.003 -4.43 -4.42 1 2937. 2513.
## 2 betas[1] -0.196 -0.196 0.025 0.025 -0.237 -0.156 1.00 1649. 2157.
## 3 betas[2] -0.025 -0.025 0.007 0.007 -0.036 -0.013 1.00 1714. 1762.
## 4 betas[3] -0.006 -0.006 0 0 -0.007 -0.005 1 3763. 3166.
## 5 betas[4] 0.004 0.004 0.001 0.001 0.002 0.007 1.00 4238. 2839.
## 6 betas[5] 0.263 0.263 0.003 0.003 0.257 0.268 1 2838. 2529.
## 7 beta_intercept -6.45 -6.45 0.09 0.088 -6.60 -6.31 1.00 1640. 2038.
Ok, so here we have some output. First, the rhat values are all 1.00, which indicates the model mixed properly. At a basic level, the negative coefficients on the first three predictors (prop_pubtransit, median_income, median_age), indicate that higher levels of public transit use, median income, and median age are all associated with lower counts of crashes in a census tract. Conversely, higher fragmentation index (beta[5]) and traffic volume (beta[6]) are associated with higher counts of crashes. These all pass the sniff test.
This is where we’ll use the y_rep predicted values from the generated
quantities block of the Stan model. We’re hoping that the
characteristics (mean, sd, quantiles) of y_rep match the true data as
closely as possible. Morris included the helper functions
ppc_central_interval and ppc_y_yrep_overlay in
the “utils_dataviz.R” file.
y_rep_pois <- as_draws_matrix(poisson_fit_1$draws("y_rep"))
ppc_central_interval(y_rep_pois, pois_data$y)
## [1] "y total: 1956, ct y is within y_rep central 50% interval: 360, pct: 18.40"
The output above shows that only 18% of the observed data points fall within the 50% predicted interval, which is not great. The graph below illustrates this. It does seem that the model doesn’t handle low or high values all that well.
ppc_y_yrep_overlay(y_rep_pois, pois_data$y,
'Poisson model PPC\ny (blue dot) vs. y_rep (yellow 50% central interval, grey full extent)')
Morris makes the point that the data are overdispersed (we saw that coming when looking at the distribution of the outcome variable earlier), meaning that the variance is higher than we expect (“we” being the users of the single-parameter Poisson model). The recommended solution is to add a simple random effects component to the model, which accounts for per-tract heterogeneity. This is similar to adding indicator variables for each tract, but assumes these varying intercepts come from a distribution.
poisson_2_model_file = file.path('stan', 'poisson_2.stan')
cat(readLines(poisson_2_model_file), sep="\n")
## data {
## int<lower=0> N;
## array[N] int<lower=0> y; // count outcomes
## vector<lower=0>[N] E; // exposure
## int<lower=1> K; // num covariates
## matrix[N, K] xs; // design matrix
## }
## transformed data {
## vector[N] log_E = log(E);
##
## // center continuous predictors
## vector[K] means_xs; // column means of xs before centering
## matrix[N, K] xs_centered; // centered version of xs
## for (k in 1:K) {
## means_xs[k] = mean(xs[, k]);
## xs_centered[, k] = xs[, k] - means_xs[k];
## }
## }
## parameters {
## real beta0; // intercept
## vector[K] betas; // covariates
## vector[N] theta; // heterogeneous random effects
## real<lower=0> sigma; // random effects variance
## }
## model {
## y ~ poisson_log(log_E + beta0 + xs_centered * betas + theta * sigma);
## beta0 ~ std_normal();
## betas ~ std_normal();
## theta ~ std_normal();
## sigma ~ normal(0, 5);
## }
## generated quantities {
## real beta_intercept = beta0 - dot_product(means_xs, betas); // adjust intercept
## array[N] int y_rep;
## vector[N] log_lik;
## {
## vector[N] eta = log_E + beta0 + xs_centered * betas + theta * sigma; // centered data
## y_rep = max(eta) < 26 ? poisson_log_rng(eta) : rep_array(-1, N);
## for (n in 1:N) {
## log_lik[n] = poisson_log_lpmf(y[n] | eta[n]);
## }
## }
## }
So, the difference in the new model is that it includes additional parameters theta and sigma, which define the distribution that the varying intercepts are drawn from.
# Unchanged from previous model
design_mat <- as.data.frame(crashes_tracts_geo) %>%
select(prop_pubtransit, median_income, median_age, frag_index, traffic) %>%
mutate(median_income = log(median_income),
traffic = log(traffic))
pois_data <- list(
N = nrow(crashes_tracts_geo),
y = as.integer(crashes_tracts_geo$tot_crashes),
E = as.integer(crashes_tracts_geo$population),
K = ncol(design_mat),
xs = design_mat
)
poisson_model_2 <- cmdstan_model(poisson_2_model_file)
set.seed(50)
poisson_fit_2 <- poisson_model_2$sample(data=pois_data, parallel_chains=cores, refresh=0)
## Running MCMC with 4 chains, at most 6 in parallel...
##
## Chain 2 finished in 11.1 seconds.
## Chain 3 finished in 15.7 seconds.
## Chain 1 finished in 15.9 seconds.
## Chain 4 finished in 18.2 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 15.2 seconds.
## Total execution time: 18.4 seconds.
pois_summary <- poisson_fit_2$summary()
vars <- c('beta_intercept', 'beta0', 'betas[1]', 'betas[2]', 'betas[3]', 'betas[4]', 'betas[5]', 'sigma')
pois_summary_subset <- pois_summary[pois_summary$variable %in% vars, ]
numeric_cols <- sapply(pois_summary_subset, is.numeric)
pois_summary_subset[numeric_cols] <- round(pois_summary_subset[numeric_cols], 3)
print(pois_summary_subset)
## # A tibble: 8 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 beta0 -4.47 -4.47 0.013 0.014 -4.49 -4.45 1.01 284. 632.
## 2 betas[1] -0.043 -0.04 0.113 0.112 -0.234 0.139 1.02 203. 424.
## 3 betas[2] 0.002 0.002 0.031 0.033 -0.051 0.052 1.01 254. 501.
## 4 betas[3] -0.003 -0.003 0.002 0.002 -0.006 0 1.01 272. 378.
## 5 betas[4] 0.016 0.016 0.006 0.006 0.008 0.026 1.01 243. 459.
## 6 betas[5] 0.265 0.265 0.015 0.016 0.239 0.29 1.00 208. 663.
## 7 sigma 0.593 0.592 0.01 0.01 0.576 0.61 1.01 398. 818.
## 8 beta_intercept -7.01 -7.03 0.4 0.422 -7.62 -6.31 1.01 230. 567.
Alright, the rhat values look fine. Some of the coefficients are slightly different as a result of adding the varying intercepts, but I’m gonna breeze past interpretation for now. Let’s check the model with posterior predictive checks.
y_rep_pois <- as_draws_matrix(poisson_fit_2$draws("y_rep"))
ppc_central_interval(y_rep_pois, pois_data$y)
## [1] "y total: 1956, ct y is within y_rep central 50% interval: 1947, pct: 99.54"
Hmmmm…this is fishy, too. We want about 50% of the observations to fall within the 50% interval, not almost 100%. I guess this model is way over-fit, which probably makes it a good candidate for some leave-one-out cross-validation.
ppc_y_yrep_overlay(y_rep_pois, pois_data$y,
'Poisson model PPC\ny (blue dot) vs. y_rep (yellow 50% central interval, grey full extent)')
So, LOO-CV should allow us to compare the models on how they perform on unseen data.
loo_1_pois <- loo(poisson_fit_1$draws("log_lik"), save_psis = TRUE)
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
loo_2_pois <- loo(poisson_fit_2$draws("log_lik"), save_psis = TRUE)
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
loo_compare(loo_1_pois, loo_2_pois)
## elpd_diff se_diff
## model2 0.0 0.0
## model1 -16787.4 1034.9
The output above identifies the varying-intercepts model as a better fit to the data, which we kind of knew already. I am interested in those Pareto k diagnostic issues:
print("Model 1")
## [1] "Model 1"
pareto_k_table(loo_1_pois)
## Pareto k diagnostic values:
## Count Pct. Min. ESS
## (-Inf, 0.7] (good) 1955 99.9% 124
## (0.7, 1] (bad) 1 0.1% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>
print("Model 2")
## [1] "Model 2"
pareto_k_table(loo_2_pois)
## Pareto k diagnostic values:
## Count Pct. Min. ESS
## (-Inf, 0.7] (good) 331 16.9% 147
## (0.7, 1] (bad) 1286 65.7% <NA>
## (1, Inf) (very bad) 339 17.3% <NA>
So, the second model does fit the data better, but the output in the second table tells me that upwards of 80% of the data points in my sample have unreliable LOO estimates. Morris mentions that this model handles overdispersion, but fails to distinguish between variance due to the spatial structure of the data and purely heterogeneous variance. If you think about it, this sort of makes sense. Model 2 doesn’t allow the tracts around a given tract to have any impact on the estimates for that tract, which is going to cause unreliability.